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Abstract: Over the last ten years, genome sequencing capabilities have expanded exponentially. 
There have been tremendous advances in sequencing technology, DNA sample preparation, 
genome assembly, and data analysis. This has led to advances in a number of facets of bacte- 
rial genomics, including metagenomics, clinical medicine, bacterial archaeology, and bacterial 
evolution. This review examines the strengths and weaknesses of techniques in bacterial genome 
sequencing, upcoming technologies, and assembly techniques, as well as highlighting recent 
studies that highlight new applications for bacterial genomics. 
Keywords: bacterial genome sequencing assembly review 

History of bacterial genome sequencing 

The first sequenced bacterial genome was Haemophilus influenzae^ in 1995. Since 
then, the Genomes Online Database^ lists 2,264 finished bacterial genomes and 
4,067 permanent draft genomes (genomes that are sequenced but not completely 
closed). The majority of these have been deposited since 2008, after the commercial 
introduction of high-throughput sequencing. A number of sequencing techniques 
have been subsequently introduced making bacterial genome sequencing significantly 
cheaper and easier. This has decreased the cost per megabase of sequence by five logs 
(see figure 1), which has allowed for sequencing of large numbers of genomes. These 
advances have allowed movement from sequencing individual genomes to sequenc- 
ing multiple strains. However, the general workflow of bacterial sequencing remains 
generally unchanged - sample preparation, DNA sequencing, sequence assembly, and 
bioinformatic analysis. This review will examine each of these, as well as examining 
some of the current applications of these technologies. 

Sample preparation 

The major advance in sample preparation is enabling more effective isolation of 
small amounts of DNA, allowing genome sequencing from limited or degraded ini- 
tial samples. This includes the development of isothermal amplification for multiple 
displacement amplification (MDA). This technique uses the phi29 DNA polymerase 
combined with random hexamers to produce DNA fragments in the multiple-kilobase 
range. ^ This allows genomic-scale sequencing from small starting samples of DNA. 
Based on studies in Anaplasma, it appears that sequencing after phi29 amplification 
provides similar genomic coverage and single'nucleotide'polymorphism'(SNP) rates 
as traditional sample preparation.'' While additional chimeric sequences (a single 
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sequence derived from two separate pieces of DNA) were 
generated, these did not interfere with genome assembly. 
MDA has been used to sequence the genome from a single 
unculturable intracytoplasmic symbiont oiDraeculacephala 
minerva.^ This may provide a significant amount of infor- 
mation on genome sequences from unculturable bacteria, 
allowing whole genome sequencing rather than the limited 
information from metagenomic studies. 

Sequencing technologies 

The biggest revolution in genomics the last several years 
has been the emergence of new sequencing technologies. 
These have shifted the bottleneck in genome sequencing 
from generation of raw sequence to bioinformatic process- 
ing of samples. Each sequencing technology has specific 
strengths and weaknesses, making selection of the appro- 
priate technique important to obtaining the desired experi- 
mental results. Tables 1 and 2 give an overview of different 
sequencing technologies and some relative strengths and 
weaknesses. Individual techniques are described below. 
However, these technologies are continuously revised; the 
mean read length of pyrosequencing, for example, has grown 
from approximately 150 bp*" to approximately 700 bp' in 
the last five years. Consultation with the sequencing center 
early in the planning stage of an experiment is helpful in 
obtaining the best results, as they can provide updates on 
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the technologies in use and tailor the sequencing runs to the 
needs of the experiment. 

Current technologies 

Pyrosequencing (454) 

Pyrosequencing (454) (Roche Inc., Branford, CT, USA) uses 
a "sequencing by synthesis" approach. Deoxynucleotides are 
added one at a time and incorporation is detected by convert- 
ing the amount of phosphorus released in deoxynucleotide 
incorporation into a light signal that is read by the sequencer. 
Because of this, it tends to have difficulty with homopoly- 
meric tracts, as the difference in light intensity between 
progressively longer nucleotide repeats is relatively less. In 
general, the strengths of pyrosequencing are its relatively long 
read lengths and rapid turnaround time, which make it espe- 
cially useful for de novo sequencing projects and organisms 
with large numbers of repeats or long repetitive regions. 

Sequencing by Oligo Ligation 
Detection 

Sequencing by Oligo Ligation Detection (SOLID) (Life Tech- 
nologies Corporation, Grand Island, NY, USA) uses a "sequenc- 
ing by ligation" approach. Numerous degenerate 8-mers are 
ligated to the single stranded DNA (ssDNA) template, with 
two nucleotides specific for the strand being sequenced and 
the remaining six bases degenerate. As the probes are ligated 



Table I An overview of current sequencing technologies 



Platform 


Run time 


Sequence yield 


Reported 


Mean read 


Paired 


Template 


Reads 






per run 


accuracy 


length 


reads 


DNA required 


per run 


lllumina MiSeq 


27 hours 


8 Gb 


>85% above Q30 


2 X 250 bp 


Yes 


1 00 ng- 


^lg 


15 M 


lllumina HiSeq 1500 


















Rapid run 


27-40 hours 


60-90 Gb 


>80% above Q30 


2x 150 bp 


Yes 


1 00 ng- 


^lg 


300 M 


High output 


8.5 days 


300 Gb 


>80% above Q30 


2x 100 bp 


Yes 


1 00 ng- 


^ig 


1.5 B 


lllumina HiSeq 2500 


















Rapid run 


27-40 hours 


90-120 Gb 


>80% above Q30 


2x 150 bp 


Yes 


1 00 ng- 


ng 


600 M 


High output 


1 1 days 


600 Gb 


>80% above Q30 


2x 100 bp 


Yes 


100 ng- 


i^g 


3 B 


lllumina GAIIx 


1 4 days 


95 Gb 


>80% above Q30 


2x 150 bp 


Yes 


1 00 ng- 


^lg 


320 M 


PacBio RS II 


2 hours 


230 Mb 


Approx 86% (Q8) 


Approx 4,500 bp 


No 


250 ng- 


^lg 


50 k 


Ion Torrent 


















Ion 314 chip v2 


2.3—3.7 hours 


30-100 Mb 


>90% above Q20 


200-400 bp 


Yes 


1 00 ng- 


^ig 


400-550 k 


Ion 3 1 6 chip v2 


3-4.9 hours 


300 Mb-I Gb 


>90% above Q20 


200-400 bp 


Yes 


1 00 ng- 


^lg 


2-3 M 


Ion 3 1 8 chip v2 


4.4-7.3 hours 


600 Mb-2 Gb 


>90% above Q20 


200-400 bp 


Yes 


100 ng- 


ng 


4-5.5 M 


SOLID 5500 W 


2-7 days 


80-160 Gb 


90% above Q40 


2 X 60 bp 


Yes 


IOng-= 


ng 


1.2 B 


SOLID 5500x1 W 


2-7 days 


160-320 Gb 


90% above Q40 


2 X 60 bp 


Yes 


lOng-' 


ng 


2.4 B 


454 GS FLX+ 


1 0-23 hours 


450-700 Mb 


Mostly >Q30 


Up to 1 kb 


Yes 


700 ng- 


ng 


1 M 


454 GSJr 


1 0 hours 


35 Mb 


Mostly >Q30 


400 bp 


Yes 


700 ng- 


ng 


100 k 



Notes: lllumina MiSeq, lllumina HISeq 1 500, lllumina HiSeq 2500, lllumina GAIIx (lllumina Inc., San Diego, CA, USA);lon Torrent (Life Technologies Corporation, Grand Island, NY, 
USA); PacBio RS II (Pacific Biosciences Inc, Menio Park, CA 94025); SOLiD 5500W, SOLiD 5500x1 W (Sequencing by Oligo Ligation Detection) (Life Technologies Corporation, 
Grand Island, NY, USA). Q score ^ - 1 0 log 1 0 P, where P is the probability of an incorrect base call. 454 GS FLX+ and 454 GS Jr (Roche Inc., Branford, CT, USA). 
Abbreviation: DNA, deoxyribonucleic acid. 
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Table 2 Relative strengths and weaknesses of current sequencing 
technologies 



Platform 


Strengths 


Weaknesses 


lllumina 


Low error rates 


Higher indel rates 


MiSeq 


Support for paired 


Errors with GC-rich 




end sequencing 


sequences 


lllumina 


Low error rates 


Relatively short read 


HiSeq 


Support for paired 


lengths 




end sequencing 




PacBio 


Long read lengths 


SNP detection less sensitive 




Detects DNA 


due to higher individual 




methylation 


read error length 


The Ion Personal 


SNP detection 


Bias with AT-rich regions 


Genome Machine® 






(PGM™) 






SOLID 


High accuracy 


Short read lengths 




Flexible configuration 




454 


Read length 


Higher indel rates 




Sequencing speed 


Difficulty sequencing 






homopolymeric tracts 



Notes: lllumina {lllumina Inc., San Diego, CA, USA); Ion Torrent (Life Technologies 
Corporation, Grand Island, NY, USA); PacBio (Pacific Biosciences Inc, Menio 
Park, CA 94025); MySeq and HiSeq (lllumina Inc.. San Diego, CA, USA), SOLID 
(Sequencing by Oligo Ligation Detection)(Life Technologies Corporation, Grand 
Island, NY, USA) 

Abbreviations: DNA, Deoxyribonucleic acid; AT, adenine and thymine, GC, 
guanine and cytosine, SNP, single nucleotide polymorphism. 

to the template, fluorescent dyes are cleaved oiT and detected 
by the sequencer. Every nucleotide participates in two ligation 
reactions, which allows for error checking of each read. This 
gives SOLID an advantage for SNP detection, as it tends to 



have high reliability in SNP sequencing. However, since each 
nucleotide sequence is based off of a combination of two reads 
(termed "colorspace"), rather than a nucleotide sequence with 
a quality score, fully utilizing these data require tools designed 
for SOLID sequences. 

MySeq and HiSeq 

MySeq and HiSeq (lllumina Inc., San Diego, CA, USA), 
machines use a "sequencing by synthesis" technique, where 
individual DNA molecules are attached to the surface of 
flow cells and isothermal 'bridging' amplification is used to 
amplify signals. These are then sequenced using reversible 
fluorophore-labeled nucleotides, which are optically read 
from each flow cell. While these have high accuracies and 
produce large amounts of raw data, the individual read lengths 
tend to be shorter, which can be problematic for genomes 
with large repeats. lllumina 's Nextera sample preparation kit 
can allow for template amounts as low as 50 ng, which can 
be useful for organisms that are difficult to culture. 

The Ion Personal Genome Machine® 
(PGM™) 

Ion Torrent Personal Genome Machine® (PGMtm) (Life 
Technologies Corporation, Grand Island, NY, USA) uses a 
"sequencing by synthesis" approach, measuring the hydrogen 
ions released during deoxynucleotide incorporation. This is 
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Figure I Cost per megabase of sequencing, from 2001 to 2012. 

Adapted from the NIH NHGRI Genome Sequencing Program website ( http.7/www.genome.gov/sequencingcosts/ ). 
Abbreviations: NIH, National Institutes of Health, NHGRI, National Human Genome Research Institute. 
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measured by semiconductors in the disposable chips used by 
the machine for sequencing. As there is no optical component 
of the sequencer, machine throughput can be increased by 
modifications to the chips used without additional sequenced 
modification, which has led to a tremendous increase in the 
throughput since the initial release. This also allows selection of 
chips giving the appropriate sequencing coverage for the desired 
application, which can make sequencing more cost-effective. 
While there is generally high accuracy, there are difficulties with 
high adenine-thymine (AT)-rich sequences, which can lead to 
gaps in coverage.^ In addition to the PGM, Life Technologies 
has also released the Proton system, which aUows for larger 
chips with more sequence per run. 

PacBio RS II Single Molecule Real-time 
Sequencing 

PacBio RS II Single Molecule Real-time Sequencing (SMRT) 
(Pacific Biosciences Inc., Menlo Park, CA 94025) uses a varia- 
tion of "sequencing by synthesis", using fluorescent-labeled 
deoxynucleotides added to a zero-mode waveguide (ZMW) 
with a DNA polymerase embedded in the bottom. As deoxy- 
nucleotides are added to the template, the fluorescent signals 
are read in real-time by the sequencer. While the accuracy of 
individual reads tends to be low (-85% or so), errors tend to be 
random, rather than due to specific DNA features, so increased 
coverage allows for high cumulative accuracy rates.' The main 
advantage of PacBio is its long read length; while mean read 
lengths tend to be approximately 5 kb, reads of more than 10 
kb are not uncommon. Also, as the machine observes the reac- 
tion in real time, it can detect some base modifications, such 
as methylation'", without additional reagents due to altera- 
tions in the deoxynucleotide incorporation time. In addition, 
experiments have been made to sequence DNA without the 
initial amplification step in library preparation." 

Future technologies 

GnuBio 

GnuBio (GnuBio Inc., Cambridge, MA USA) sequencer uses 
a sequencing by amplification approach, using microfluid- 
ics to combine target selection, DNA amplification, library 
preparation, and sequencing into one instrument.'^ While 
this is targeted at clinical applications, this has a number of 
applications for microbial sequencing. Beta testing began 
in April of 2013. 

GhdION/MinilON 

GridlON/MinilON (Oxford Nanopore Technologies 
Ltd, Oxford, UK) systems use nanopore technology and 



disposable cartridges to perform a number of possible 
experiments, including DNA sequencing.'^ The nanopores 
use voltage variation produced when ssDNA is fed through 
the nanopore via an enzyme. No amplification step is neces- 
sary, allowing examination of DNA sequence modifications 
(such as methylation) directly. The GridlON is meant to 
be an expandable, reusable system for core laboratories, 
while the MinilON is a single-use device for individual 
laboratories. 

Genome assemblers 

While raw sequence data is useful, it is significantly more 
valuable after assembly into contiguous DNA sequences 
(contigs). There are a number of strategies for assembly, and 
sequences can be assembled either de novo or assembled 
against a reference sequence. A number of assemblers have 
been used on bacterial genome sequences; some of the more 
common options are discussed below in alphabetical order. 

ABySS 

ABySS (Assembly By Short Sequences) is a de novo paral- 
lel paired-end assembler that works with Illumina, SOLID, 
pyrosequencing, and Sanger reads.'" It also works with com- 
binations of technologies by calculating the distribution of 
read sizes for each, so an accurate empirical distribution can 
be obtained. In addition, it has been adapted for ti-anscriptome 
assembly with RNA seq data. 

Celera Assembler 

CABOG (Celera Assembler with the Best Overlap Graph)'^ is 
a de novo assembler that was first developed for the original 
human genome project. It has subsequently been modified 
to assemble pyrosequencing"', Illumina, and PacBio reads. 
While it is primarily geared toward mammalian sequences, 
it can also be utilized for microbial sequences." 

Edena 

Edena (Exact DE Novo Assembler) is a de novo overlaps 
graph-based short reads assembler'* It requires reads to be a 
similar length, as it is designed for Illumina-based sequences; 
therefore, pyrosequencing and Sanger-based reads would need 
to be trimmed to a similar length to be processed. This program 
is specifically designed for bacterial genome assemblies. 

EULER-SR 

EULER-SR is a de novo assembler that uses an A-Bruijn 
graph technique to assemble Sanger, pyrosequencing, and 
Illumina reads." This is geared toward assembly of DNA 
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sequences from individual organisms, as well as clustering 
of sequences from metagenomic analyses. 

MaSuRCA 

MaSuRCA (Maryland Super Read Cabog Assembler) is a 
new de novo genome assembler that combines de Bruijn 
graphs and Overlap-layout-consensus approaches to increase 
efficiency.^" It can use a combination of short (lUumina and 
SOLID) and longer (pyrosequencing) reads. This assembler 
performed best on a recent comparison of several modern 
assemblers with a number of bacterial genomic data sets.^' 

MIRA 

MIRA (Mimicking Intelligent Read Assembly) is a whole 
genome shotgun and expressed sequence tag (EST) assem- 
bler-^ for Sanger and pyrosequencing reads, as well as 
Illumina, Life Technologies, and PacBio reads with the 
development version. It can perform both de novo and 
reference-based assemblies. It features sequence editors, 
allowing repair of sequencing errors and use of quality data 
in generating assemblies. It also will assemble to a reference 
sequence and call SNPs and other mutations. 

SOAP suite 

SOAPdenovo2 (Short Oligonucleotide Analysis Package) is 
made up of multiple modules that perform error correction, 
assembly, paired end mapping, and scaffold construction^", 
and is specifically designed for de novo assembly of Illumina 
reads. While this was designed for large genomes, it has been 
tested and works well on microbial genomes as well. There 
is a separate program, S0AP2 and S0AP3, that align reads 
to reference genomes. In addition to the assembler, there are 
additional tools for SNP and indel detection. 

SOPRA 

SOPRA (Statistical Optimization of Paired Read Assembly) 
is a de novo assembler that attempts to compensate for inac- 
curacies in the high throughput reads.^^ It accepts pyrose- 
quencing, Illumina, and SOLID reads, and can use data on 
mate pair distances to create scaffolds. It can convert SOLID 
colorspace to base-space, and use that for quality checking. 
However, SOPRA requires contigs as input; the developers 
recommend Velvet as a contig assembler, but the program 
can use FASTA contigs generated by any program. 

Velvet 

Velvet is a De Bruijn graph-based de novo assembler-' that 
can assemble Illumina, SOLID, pyrosequencing, and Sanger 



reads. In addition, if compiled to support colorspace, it can 
use colorspace assembly as well as base-space assembly. 
Velvet is one of the first De Bruijn graph assemblers, and 
has continued to be updated, including updates to allow for 
mixed-length assembly and paired-end assembly.^^ 

Optical mapping 

In addition to traditional assembly of sequence reads into 
contigs, high-resolution optical mapping has been combined 
with contig assembly to allow more rapid assembly of con- 
tigs and determination of gap locations.^' Software is able 
to take the optical map and arrange contigs, either with the 
assistance of a reference sequence or de novo. Whole genome 
mapping has been used as a scaffold to perform the initial 
assembly of pyrosequencing reads to better identify gaps in 
sequence coverage, allowing complete genome assembly 
without paired-end sequencing.^" 

Accessory programs 

In addition to sequence assembly, there are a number of other 
computer programs that can be helpful in further processing 
sequence data. 

Trimmomatic 

Trimmomatic is useful for processing Illumina data, screen- 
ing libraries for a number of quality parameters, including 
adapter trimming, cropping, trimming based on a minimum 
length, and converting quality scores.'' Sequences that do 
not meet quality guidelines are automatically trimmed out. 
However, unlike a number of other methods for sequence 
trimming, Trimmomatic is aware of paired end data, and 
maintains the paired end links. 

CGView Comparison Tool 

CGView Comparison Tool (CCT) is a program to visu- 
ally compare multiple circular genomes'- that takes 
sequence alignment output and uses it to visualize the 
results against a reference genome. One strength over 
many other tools is the ability to compare thousands of 
genomes in the same map. 

Artemis Comparison Tool 

The Artemis Comparison Tool (ACT) is another tool to visu- 
alize multiple genome comparisons." For people who use 
Artemis for genome annotation, the user interface for ACT 
is almost identical, making it easy to use. The interactive 
user interface makes it useful for examination of genomes 
and SNP detection. In addition to the stand-alone program, 
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a web tool (WebACT) has also been developed for online 
work.^'' 

Galaxy 

Galaxy is a web application that can use a variety of bioin- 
formatics tools. It is also extensible, so programmers can 
add support for nearly any desired bioinformatics tool. While 
it started as primarily a method for working with text-based 
data, such as DNA sequences, recent developments have 
added data visualization tools as well.'' The main strength 
of Galaxy is the ability for multiple researchers to work on 
data sets together via web browsers. In addition to sharing 
datasets, researchers can also share workflows, allowing oth- 
ers to replicate their results and allowing editing and saving of 
workflows for future use. There are public servers for Galaxy, 
but it can also be downloaded and run locally or in the cloud 
to use additional storage and computing resources. 

Applications of genome sequencing 

Clinical medicine 

One recent development has been the application of high- 
throughput DNA sequencing to clinical applications. First, 
as requirements for DNA template concentration and purity 
for genome sequencing decrease, clinical samples can be 
directly sequenced, allowing for organism identification and 
possible identification of traits such as antibiotic resistance." 
While complete genome assembly is still time consuming, 
high-throughput sequencing and assembly can reveal a 
tremendous amount of information about target organisms 
without obtaining the complete genome sequence.'** This 
may also make large numbers of clinical samples available 
for use in research studies, as complete genome sequences of 
Chlamydia trachomatis were isolated from discarded swabs 
after testing.'' 

In addition to rapidly determining genotype/phenotype 
association, whole genome sequencing (WGS) techniques 
have been used in several public health surveys, analyzing 
nosocomial infections in hospitals and differentiating them 
from non-outbreak isolates'"' and in retrospective analysis 
to track the spread of infections through hospitals."" Whole 
genome sequencing gives the ability to determine where an 
infection was acquired from, and has, in some cases, revealed 
previously unknown bacterial reservoirs.''^ This may be aided 
by the continued sequencing of multiple bacterial strains, as 
evidenced by the determination of a minimum core genome 
for Streptococcus suis, which allowed determination of 
genes unique to animal versus human strains. Other genomic 
analyses have detected infection with multiple strains of the 



same organism, revealing previously unknown transmission 
events.*' 

Another related activity is using microbial genomics and 
metagenomics for forensics. This has been done for cases of 
bioterrorism, such as the anthrax letter attack investigation, 
where the isolates from the letters were linked and were dif- 
ferent from those previously suspected in the investigation.'*'' 
Microbial sequencing may also be used in the future for 
criminal investigation, as skin microbial populations are 
relatively unique and can be used to identify items handled 
by people up to two weeks previously."' 

However, the abundance of sequence information makes 
bioinformatics the bottleneck in utilization of sequences in 
clinical samples. Future developments may help automate 
sequence assembly and annotation'*'', as well as automating 
bacterial typing from whole genome sequences,"' speeding 
analysis. 

Genomic archaeology 

In addition to clinical medicine, the reduction in DNA tem- 
plate requirements for sequencing have produced profound 
developments in genomic archaeology. Medieval isolates 
of Yersinia pestis from victims of the black death were 
sequenced using the lUumina platform, yielding 93% genome 
coverage.*^ This has revealed that current isolates of Y. pestis 
appear to be descended from the medieval strain, and that 
the virulence of the Black Death organism does not appear 
to be due to bacterial genotype. In another study, multiple 
ancient isolates of Mycobacterium leprae were sequenced 
from bone lesions and compared to modern isolates.*' This 
is the first study to assemble a complete genome de novo 
from ancient sequences, rather than use a modem reference 
sequence for scaffolding. This has allowed tracking of the 
spread of leprosy from ancient times to modern day, as well 
as drawing conclusions about why leprosy disappeared from 
Europe but persists in many developing countries today. Other 
studies have examined the bacterial composition of ancient 
dental calculus,'" allowing for comparisons of historical bac- 
terial populations compared with modern day oral flora, and 
using that to examine environmental factors associated with 
dental disease. Finally, another study examined the bacterial 
populations in waterlogged, preserved wood," which will aid 
in preserving historic wrecks and establishing underwater 
archaeological parks. 

Metagenomics studies 

While sequencing advances have caused a huge growth in the 
field of metagenomics, metagenomic studies have started to 
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be exploited as sources of raw sequence for genome projects. 
The increases in DNA sequencing throughput have allowed 
shifting metagenomics studies from amplification of 16S to 
shotgun sequencing of the entire sample DNA population." 
Generally, this depends on having a predominance of small 
numbers of microbial genomes in the population, allowing 
for assembly into complete or near-complete genomes."^'' 
However, one study combined multiple metagenomics studies 
from a population to assemble twelve near-complete or com- 
plete genome sequences^^ from low-prevalence populations. 

Bacterial evolution 

With the large numbers of sequenced genomes, a variety of 
techniques from other organisms have subsequently been 
applied to bacteria. Genome-wide association studies have 
begun to be applied to bacteria. In one study, Campylobacter 
strains from a variety of hosts were examined to determine 
factors involved in host specificity.^* While most lineages 
were able to switch hosts, some lineages were associated 
with specific hosts. These were linked with vitamin bio- 
synthesis genes, and cattle isolates were able to grow better 
in vitamin B^-depleted media. Another study examined the 
microbiota in patients with and without type 2 diabetes, 
finding a significant decrease in butyrate-producing bacte- 
ria and an increase in opportunistic pathogens." 

Other studies have examined a number of bacteria to 
determine changes associated with the development of 
pathogenicity. One study found that pathogenic bacteria 
have smaller genomes, with less ribosomal RNA, less tran- 
scriptional regulators, and more genes for toxins and DNA 
replication.^* Similar reductions are detected in experimental 
populations with multiple generations.^' Other studies have 
examined multiple strains to correlate phenotypic differ- 
ences with polymorphisms and transcriptional differences in 
bacteria that are unable to be cultured.*" Other studies have 
examined the rate of polymorphism formation in multiple 
species, finding that SNPs can occur in non-random loca- 
tions depending on the nature of the mutation.*' 

Future work will likely involve correlating genomic data 
with transcriptional regulatory data, metabolic pathway 
reconstruction, and proteomics data.*^ While the ultimate goal 
would be to establish whole-cell models of bacterial systems, 
the raw data to drive these models will still be complete, 
edited bacterial genomes. 

Conclusion 

Advances in sample preparation, DNA sequencing, and 
assembly technology have caused an explosion in the number 



of sequenced bacterial genomes, and are enabling new uses 
for bacterial genome sequencing. As technology improves, 
the number of applications will only increase, making 
understanding the spectrum of technology more important. 
Further, collaboration will be more important, making web 
tools for manipulation of genomic data more useful. 
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